Starbucks Capstone Challenge

A capstone project as part of my Udacity Data Scientist Nanodegree Program

Project Overview

The objective of this project is to do a detailed analysis of the simulated data provided by Udacity (simplified version of the real Starbucks app data) that mimics customer behavior on the Starbucks rewards mobile app, along with a machine learning model that predicts whether a customer will respond to an offer sent to respective users.

Once every few days, Starbucks sends out an offer to its users' mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. ALso, not all users receive the same offer, and that is the challenge to solve with this data set.

The goal here is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. Data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, it can be assumed the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

Also a transactional data is provided that contains user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

Problem Statement

Given the datasets below, the objective is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer reward, and to predict which users, who normally wouldn't make a purchase, would respond to a sent offer and make a purchase through it. This can be done by first answering a below questions:

  1. What percentage of Customers view offers and find out their characteristics?
  2. What percentage of Customers respond to offers and what percentage complete the offer after viewing it?
  3. How much do adevertisements contribute in user transactions?
  4. Predict whether a Customer will respond to an offer or not using demographics and offer reward data?

Metrics

The F1 score is chosen to be the metric as it defines the harmonic mean of precision and recall taking both metrics into account.

Data Understanding

Import required libraries

Read the Data Sets

The data is contained in three files:

Lets understand each of these data sets in detail.

Understanding offer data from portfolio data set

portfolio.json

Following are 3 different offer types available in the dataset.

Understanding Demographic data from profile data set

Demographic data for customers is provided in the profile dataset. The schema and variables are as follows:

profile.json

Understanding transcript data set

The schema for the transactional data is as follows:

transcript.json

There are 4 unique events are available in the data set i.e. transaction,offer received, offer viewed, offer completed

Number of people in transcript data set are the same as the number of people in the Demographics Data. This would be easy to combine the datasets

Data Preparation

Now that we understood the data sets, lets proceed to the next stage i.e. data preparation/data wrangling of all data sets.

Data Wrangling of Portfolio data set

Handling of categorical variables

Data Wrangling of profile data set

Data Wrangling of transcript data set

Data Exploration

Let's perform some data exploration using various visualization techniques to gain insights about the datas ets.

Find out the distribution of our users based on their age and gender.

There is a big spike around 120 level. So it's evident from the chart that the distribution is more for customers with age around 120 i.e. more number of customer entries are there in the data set with age around 120 years.

This is because there are same users who selected their gender as O (which is other than male or female) and they all have an exact age of 118.

Income for those people are also have Null entries in the dataset as shown below.

It's concluded that there are 2175 users of age 118 years who don't prefer to share their personal info like gender and income details. So let's separate those users to a new data frame. Add a new column that identifies the user those who provided any info with value as 0 else 1

Customer Distribution based on user info provided

Let's repeat the exploration of the age distribution by excluding the users who didn't provide any user info.

Distribution of age for the customers for which user info is available

From the above visualization though for all customers, user info is available, majority of customers in the dataset are in the groups of late 50's or early 60's, and no of customers decreases as we move away from the peak i.e. age after 65 years.

Next lets look at the gender distribution by filtering out the customers those who didn't provide any personal info.

After filtering out the customers who have not provided any personal info, its observed that there are 8484 male customers which is 57.23% of the total provided personal info. And less than 16% Female customers than Males have provided personal info and other categories provided 1.43%.

Now Let's look at the distribution of user incomes for customers who provided personal infos.

After filtering out the customers who have not provided any personal info, its observed that many users have an annual income in the range between 30000 USD and 50000 USD and majority of the customers having income in the range between 50000 USD to 75000 USD. The income distribution gets lesser when salary range increases. Meaning there are less users who have high salary range.

Now Let's start exploring the transcript dataset.

It's clearly visible that the distribution of the events in the transcript dataset has two kind of events. 1 is offer type and the other is transactions.

There are almost 55% of records in the transcript dataset contains events with all offer related data i.e. 24.92% is offer received, 18.86% is offer viewed, 10.84% is offer completed. And 45% records are of transactions type data.

Hence it can be deduced that not all customers who received the offer viewed it and not all customers who viewed the offer completed it.

Here is the percentage of offers viewd from all the offers sent to customers.

This raises the following question again.

1. which transactions were completed because of the sent offers?

2. what are the characteristics of users who completed the offers?

To answer those questions lets continue our data exploration and analysis. Let's begin with the column rename and merging of the data sets.

Since reward_given and the offer_reward columns are identicle for all records having offer completed (offer_completed==1). So let's drop the reward_given column.

Now let's look at the distribution of the offers sent to Customers

From the visualization, we can conclude almost same number of BOGO (Buy 1 Get 1) and Discount offers are sent to users which is almost double the number of Advertisement offers sent.

Now let's select records where transactions are occurred after receiving an offer and while that offer is valid.

To achieve this let's divide the dataframe into 3 groups based on offer received, offer viewed and offer completed. Then merge the data frames and add an offer_expiry column to display deadline for each offer ids.

Now we merge those dataframes on user_id and offer_id columns, to be able to compare the time between transactions.

Let's ensure no user completed an offer in the same hour twice. This will allow to remove duplicate values in the merged dataframe.

Let's ensure no user received an offer in the same hour twice. This will allow to remove duplicate values in the merged dataframe.

Now that we removed the duplicate entries, let's select users who completed the offer after viewing it and completed before it expires.

Lets compare this result with the original number of people who completed an offer through a visualization. This would clarify the

number of people who completed an offer after viewing it versus those who completed without viewing it.

Based on the pie chart its evident that 71% of completed offers were made after users viewed them.

Let's look at the effect of advertisement on the transactions that users have made.

Finally, let's combine both dataframes, to see overall effect of an offer, whether it's BOGO, Discount or Advertisement on the behaviour of the user.

So there are 33081 transactions i.e. 19.22% of the total transactions that are influenced by the offers.

Now let's dive into the characteristics of users those who are influenced by offers.

First let's see how the gender affects the amount spent in a transaction, in a way which type of discount attracts which gender and plot Gender wise average amount spent. for ths customers where personal info is available.

From the barchart visualization, its observed that Females spend on an average of 3.99 USD which is more than the average spend of males i.e 3.48 USD. And the other category seems to be spending the most with an average of USD 4.35. This is probably due to their low number compared to males and females.

The visualization on the right display how the behaviour of genders attract towards offer rewards. Males seems to respond more towards 2.0 USD, 3.0 USD and 5.0 USD offer rewards than Females where as Females respond more towards 10.0 USD offer reward compared to males.

From the other categories, because of the less in numbers, a conclusion cannot be made however the visualization suggests they are respond slighly more towards 5.0 rewards.

A similar representation of the gender behaviour towards offer response is plotted using a Heatmap.

Let's see spread of data for amount spent by the users who completed the offer after viewing it and before it expires and for users who didn't complete the offer.

From the left box plot, it's observed that customer who didn't provide any personal info tend to spend more per transaction for completed offers. However from the right box plot it's observed, customer who provided personal info tend to spend a lot more.

Since there are much outliers present for the amount spent by customers, hence the y axis is limitted to display the trend where the the spread is more.

Also find the coorelation between Age Groups and Amount spent by customers for the transactions with complted offers and without offers.

From the above scatter plot with fitted regression line, its observed that there is almost positive coorelation between the Average amount spent and age groups for both completed and uncomplete offers with a dense of data showed between the age group of late 50s to late 60s. People with higher ages tend to spend more on the transaction amount in both cases.

Since there are much outliers present for the amount spent by customers, hence the y axis is limitted to display the trend where the the spread is more.

Now lets' see how Customers responded to offers when different channesls used.

First fiter data into two categories. First, channels used for completed offers and second Channels used for not completed offers

We can see that user responded to the offers almost the same percentage for all channels being used to sent them the offers.

Let's find the coorelation between Customer Income and Amount Spent by customers for the transactions with complted offers and without offers.

We can see that coorelation between Customer Income and Amount Spent by customers for both transactions with complted and not completed offers is positive. The amount spent per transaction is more when the user income is more, which is expected.

Since there are much outliers present for the amount spent by customers, hence the y axis is limitted to display the trend where the dense of data is more.

Next, let's see how the age plays a role in responding to offer rewards.

Based on the above visualization, it's noticiable that age doesn't play a big role in responding to offers rewards, meaning all age group people respond to the offers almost in a similar fashion.

Data Modeling

Predict User Response

Now that we are done with the data exploration and analysis to answer the queries, lets build a model to predict the User Responses to see whether a user will respond to an offer or not. To achieve let's assign our target variable i.e offer_success column to be predicted to y, and assign features i.e user demographics variables to X.

Lets use RandomForestClassifier and LinearSVC modelling techniques to build our model and predict the user response.

Scale features

Train Our Classifier

Modeling using RandomForestClassifier

Feature importance by RandomForestClassifier

Predict test data using RandomForestClassifier

Data Modeling Using LinearSVC

Predict test data using LinearSVC

Model Evaluation

We use the classification report from sklearn to evaluate the model.

Model Evaluation for RandomForestClassifier

Based on the model evaluation, the performance of our RandomForestClassifier model prediction is as follows

Data Modeling Using LinearSVC

Based on the model evaluation, the performance of our lLinear SVC model prediction is as follows

Justification

Based on the analysis and exploration done, how customer demographics and the offer reward affect user response to offers or advertisements sent are identified in this project.

First, we identified users for whom there are demographic information was missing, and we classified them into a separate group. There are 13% of the total users who have prefered not to share their personal info. This helped us identify accurately the gender distribution in the dataset. We saw that males take up 57% of total users and females take up 41%, leaving 1% for others.

Next we observed that the majority of users are in their late 50's or eary 60's, and that the number of users decreases as we move away from the peak. And age doesn't affect the user attraction towards certain offer rewards. However, it was seen that the average amount spent per transaction increases as user age increases.

Many users have an annual income in the range between 30000 USD and 50000 USD, but the majority are in the range between 50000 USD and 75000 USD, and of course less users have high annual salary. The amount spent per transaction is more when the user income is more, which is expected.

After that we saw that 75% of all the received offers were actually viewed, and that 71% of users complete offers after they view them leaving 29% completing offers unintentionally.

It was seen that offers influence 19% of the total transactions or completed offers that occured, which is pretty big, and that users are 7% more likely to respond to offers when they are sent through social media.

We saw how gender plays role in the average amount spent by a user, and also in responding to which type of offer reward. Males responded more to the 2, 3, and 5 dollar rewards while females respond more to the 10 dollar rewards, and on average, females spend more than males.

People who chose to stay anonymous tend to spend more per transaction in the group that responded to offers, however, for the other group it's completely the opposite, known users spend a lot more than anonymous users.

Finally, we built a model that predicts whether a user will respond to an offer or not based of demographics and offer reward, and the model predicted this with an accuracy of 87%, a F1-score of 0.65 for identifing those who will repond to offers, and an F1-score of 0.92 for those who won't.

Conclusion

Reflection

The problem that I chose to solve as part of this project is to build a model that predicts whether a customer will respond to an offer or not. The approach being used for solving this problem has mainly three steps.

The most interesting aspect of this project is the combination between different datasets, using predictive modeling techniques and analysis to provide better decisions and value to the business. The data exploration and wrangling steps were the longest and most challenging part. The toughest part of this entire analysis was to find right logic and strategy to answer the problem statements and conveying them with different visualization techniques.

Improvement